MY472 Summative 3 Project

What, if any, characteristics and factors discriminate MPs who tend to ask questions about economic issues from MPs who tend to ask questions about health and welfare issues?

Author

Candidate 30552

Link to public GitHub repository: https://github.com/lenmetson/my472-final-assignment.

Words: 722

Introduction

Parliamentary questions are a critical part of how MPs represent their constituency. Therefore, I focus on asking what factors about the constituency an MP represents drives their focus on economic or health and welfare topics in 2023.

Data

I drew on data from two sources: the UK Parliament API (“API”), and the UK House of Commons constituency dashboard (“dashboard”). I limit my analysis to questions asked in 2023. I store data efficiently, I used a local relational database.

API

First, I pulled the text of questions from the API oral and written question endpoints. For oral questions, the API only returns questions asked in the House of Commons (HoC). However, the written question endpoint returns questions from both the House of Lords (HoL) and the HoC. Therefore, I added a parameter to the request URL to only return written questions asked by members of the HoC. I merged both types of question into one table: questions.

For each question, I wanted to be able to pull in additional data about the MP who had asked it and the minister they asked it to. Some MP characteristics, such as their party affiliation or seat, change over time. The API members endpoint allows queries to specify date. It then returns data as valid from that date. Thus, I queried the API for each unique MP-date pairs from my questions table.

However, I did not want to write out a table with data on each MP for every day they had asked a question. Therefore, I grouped the clean response table by each unique combination of MP and their characteristics and summarised the earliest and latest date this combination was valid for.

Dashboard

The data on the demographics of constituencies from the UK Parliament API is very limited. Therefore, I used the Commons Library constituency dashboard to add demographic variables. This data source does not have an API endpoint and requires each constituency to be looked up using a search tool. Therefore, I used Selenium to interactively scrape the data.

To create the constituencies table, I merged the scraped data with election results and shapefiles pulled from the API.

Measurement

To measure whether a question is about (1) economic issues or (2) health and welfare, I use a simple dictionary approach. Whilst limited compared to machine learning classification approaches, dictionary string matching was more feasible for this project as it does not require expert labelling of a training set. I used pre-defined policy topic dictionaries from Lexicoder (Albugh, et al., 2013). I wrote out the results of my measurement to a table called question_topics.

I wrote out the results of my measuremnt to a table called question_topics.

Final database

This resulted in 5 tables in my local database:

  1. questions
  2. members
  3. constituencies
  4. parties
  5. question_topics

Analysis

To answer which factors about an MP’s constituency discriminate whether they ask questions on economic or health & welfare topics, I construct a measure of slant at the MP level, defined as:

\[ slant = \frac{N (economic)}{N(questions)} - \frac{N(health\_welfare)}{N(questions)} \]

This measure ranges from -1 to 1. A value of -1 indicates a perfect health and welfare slant – I.e. the MP only asks health & welfare questions - and a value of 1 indicates a perfect economic slant.

Distribution of slant variable for oral and written questions asked by MPs in 2023. Black line indicates where there is no slant.

Figure 1 shows that most MPs have no slant and that there is a slight skew towards health & welfare questions.

Figure 2 shows the geographic distribution of economic slant across the UK. Whilst there is no clear regional pattern, it seems that small clusters of urban areas, such as the North of Wales and North West of England appear to be slightly more focused on health & welfare.

Map showing the geographic distribution of question slant. A slant score of -1 indicates all questions were about health and welfare issues, a score of 1 indicates all questions were about economic issues. Constituencies shown as white are where the MP asked no oral or written questions in 2023.

To explore this further, Figure 3 shows, that population density is correlated with having a stronger focus on health & welfare compared to economic issues. This suggests that one factor which drives whether MPs focus on different topics in their questions is how urban the constituency they represent is.

A slant score of -1 indicates all questions were about helath and welfare issues, a score of 1 indicates all questions were about economic issues.

Figure 4 shows the correlation between slant and the number of households claiming universal credit (UC) in the constituency. Surprisingly, there is no overall relationship between the density of UC claimants and focus on health and welfare issues.

Finally, Figure 6 shows that the slant of an MP’s question is uncorrelated with the marginality of their seat.

A slant score of -1 indicates all questions were about helath and welfare issues, a score of 1 indicates all questions were about economic issues. Seat marginality as the average majority that an MP’s party won in the previous four elections in the constituency the MP currently represents. Higher values for marginality indicate safer seats.

Overall, I find that characteristic of an MPs constituency do not particularly discriminate whether they focus on economic or health & welfare issues.

References

Albugh, Quinn, Julie Sevenans and Stuart Soroka. 2013. Lexicoder Topic Dictionaries, June 2013 versions, McGill University, Montreal, Canada.

Code appendix

knitr::opts_chunk$set(
    echo = FALSE, 
    eval = FALSE,
    message = FALSE, 
    warning = FALSE, 
    error = FALSE, 
    fig.width=7, fig.height=6,
    fig.align='center') 

# PICKUP
# Can also be run by sourcing scripts/00_setup.R

# Define function to install or load packages
load_packages <- function(x) {
  y <- x %in% rownames(installed.packages())
  if(any(!y)) install.packages(x[!y])
  invisible(lapply(x, library, character.only=T))
  rm(x, y)
}

# Load required packagess
load_packages(c(
    "tidyverse",
    "here",
    # Database management
    "DBI",
    "RSQLite",
    # APIs and webscraping
    "httr",
    "RSelenium",
    # Text analysis 
    "tm", 
    "XML",
    # Geospatial plots
    "tmap", 
    "sf", 
    "ggrepel", 
    # Random forests
    "parallel",
    "ranger",
    "tidymodels",
    "vip", 
    "rpart", 
    "rpart.plot", 
    # Plotting 
    "gridExtra"
    ))

replace_null_with_na <- function(x) {
  if (is.list(x)) { # Checks for whether the item is a sublist 
    lapply(x, replace_null_with_na) # if it is, apply the function for each of the elements within the sublist
  } else { # If it isn't, simply apply the main function
    ifelse(is.null(x) || x == "null", "NA", x) 
  }
} 

replace_na_chr <- function(df) { # NOTE function adapted from ChatGPT output
  df <- df %>%
    mutate(across(where(is.character), ~na_if(., "NA")))
 
  return(df)
}

db_table_check <- function(table){
  rows <- dbGetQuery(db, paste0("SELECT COUNT(1) FROM ", table))
  cols <- dbListFields(db, table)
  
  result = list(
    table = table,
    n_rows = rows[[1]],
    col_names = cols)
  return(result)
}

db <- DBI::dbConnect(RSQLite::SQLite(), here("data/parliament_database.sqlite"))


# This code can also be run by sourcing scripts/01_pull-oral-questions.R 

GET_qs <- function(endpoint_url, n_skip = 0) {
  url <- paste0(
    endpoint_url,
    "?parameters.skip=",
    n_skip,
    "&parameters.answeringDateStart=2023-01-01&parameters.answeringDateEnd=2023-12-31", # Limit to 2023
    "&parameters.take=100")

  response <-
    httr::GET(url) %>%
    httr::content("parsed") # Use :: because tm masks content 

  return(response)
}

# Define functions to pull all questions

pull_all_oral_qs <- function(endpoint_url){

  # Calculate how many questions are in the end point
  n_resp <- httr::GET(paste0(
    endpoint_url,
    "?parameters.answeringDateStart=2023-01-01&parameters.answeringDateEnd=2023-12-31", # Limit to 2023
    "&parameters.take=1")) %>%
    httr::content("parsed")
  n <- n_resp$PagingInfo$GlobalTotal

  # Questions can be pulled in batches of 100,
  # calculate how many time we will have to pull
  n_loops <- ceiling(n / 100)

  print(paste0("LOG | ", Sys.time(), " | Oral question pull starting"))

  for (i in 1:n_loops) {

    n_skip <- (i - 1) * 100 # Skip however many 100s the loop has run

    if (i == 1) { # On first iteration, make new list

      response <- GET_qs(endpoint_url, n_skip)
      response <- response$Response

    } else { # On all other iterations, append to existing list

      response_new <- GET_qs(endpoint_url, n_skip)
      response_new <- response_new$Response
      response <- c(response, response_new) # Merge responses

    }

    print(paste0("LOG | ", Sys.time(), " | ", i, " of ", n_loops, " done.")) # Print progress message
    Sys.sleep(1) # Sleep to avoid hammering the API

  }

  print(paste0("LOG | ", Sys.time(), " | Oral question pull done :)"))
  return(response)
}

## APPLY FUNCTIONS

oral_questions <- pull_all_oral_qs(
  "https://oralquestionsandmotions-api.parliament.uk/oralquestions/list")

saveRDS(oral_questions, "data/oral_questions_2023.RDS")


# This code can also be run by sourcing scripts/02_pull-written-questions.R

GET_qs_written <- function(endpoint_url, n_skip = 0) {
  url <- paste0(
    endpoint_url,
    "?skip=",
    n_skip,
    "&tabledWhenFrom=2023-01-01&tabledWhenTo=2023-12-31", # Limit to 2023
    "&house=Commons", # Limit to HoC 
    "&take=100")   
  
  response <-
    httr::GET(url) %>%
    httr::content("parsed") # Use :: because tm masks content 

  return(response)
}

pull_all_written_qs <- function(endpoint_url){
  
  n_resp <- httr::GET(
    paste0(
      endpoint_url, 
      "?tabledWhenFrom=2023-01-01&tabledWhenTo=2023-12-31", # Limit to 2023
      "&house=Commons", # Limit to HoC
      "&take=1")) %>% 
    httr::content("parsed")

    n <- n_resp$totalResults

    # Questions can be pulled in batches of 100, calculate how many time we will have to pull
    n_loops <- ceiling(n/100)

    for(i in 1:n_loops){

        n_skip <- (i-1)*100 # Skip however many 100s the loop has run

        if(i==1){ # On first iteration, make new list

            response <- GET_qs_written(endpoint_url, n_skip)
            response <- response$results

        } else { # On all other iterations, append to existing list 
            responseNew <- GET_qs_written(endpoint_url, n_skip)
            responseNew <- responseNew$results

            response <- c(response, responseNew) # Merge responses 
        }

    print(paste0("LOG | Written questions | ", Sys.time(), " | ", i, " of ", n_loops, " done.")) # Print progress message

    Sys.sleep(0.5) # Sleep to avoid hammering the API
    }

 print(paste0("LOG | Written questions | ", Sys.time(), " | Written question pull done :)"))
 return(response)

}


written_questions <- pull_all_written_qs("https://questions-statements-api.parliament.uk/api/writtenquestions/questions")


saveRDS(written_questions, "data/written_questions_2023.RDS")


oral_questions <- readRDS("data/oral_questions_2023.RDS")


for (i in seq_along(oral_questions)) { 
   # remove sublists, otherwise names do not match
    oral_questions[[i]]$AskingMember <- NULL
    oral_questions[[i]][["AnsweringMinister"]] <- NULL
    
    oral_questions[[i]] <- replace_null_with_na(oral_questions[[i]])

  if (i == 1){
    oral_question_df <- data.frame(oral_questions[[i]])
  } else {
    oral_question_df2 <- data.frame(oral_questions[[i]])
    oral_question_df <- rbind(oral_question_df, oral_question_df2)
  }
}

rm(oral_question_df2, i)

### Clean dataframes and merge into one table ####

oral_question_df <- oral_question_df %>%
  select(
    question_id = Id,
    question_text = QuestionText,
    asking_member = AskingMemberId,
    question_tabled_when = TabledWhen,
    question_answering_when = AnsweringWhen,
    question_answering_body = AnsweringBody,
    question_answering_body_id = AnsweringBodyId,
    answering_member = AnsweringMinisterId) %>%
  # Ensure variables are the correct class 
  mutate(
    question_id = as.character(question_id),
    question_text = as.character(question_text),
    asking_member = as.character(asking_member),
    question_tabled_when = as.character(as.Date(question_tabled_when)),
    question_answering_when = as.character(as.Date(question_answering_when)),
    question_answering_body = as.character(question_answering_body),
    question_answering_body_id = as.character(question_answering_body_id),
    answering_member = as.character(answering_member),
    oral_written = "oral") # add written_oral dummy


written_questions <- readRDS("data/written_questions_2023.RDS")

for (i in seq_along(written_questions)) { 
    
  # Remove links sublist by keeping only "value" 
  written_questions[[i]] <- written_questions[[i]]$value

  # remove sublists, otherwise names do not match
  written_questions[[i]]$groupedQuestions <- NULL
  written_questions[[i]]$attachments <- NULL
  written_questions[[i]]$groupedQuestionsDates <- NULL
  # Replace nulls with NAs
  written_questions[[i]] <- replace_null_with_na(written_questions[[i]])

  if (i == 1){
    written_question_df <- data.frame(written_questions[[i]])
  } else {
    written_question_df2 <- data.frame(written_questions[[i]])
    written_question_df <- rbind(written_question_df, written_question_df2)
  }

  print(paste0(i, " of ", length(written_questions)))
}

rm(written_question_df2, i)

written_question_df <- written_question_df %>%
  select(
    question_id = id,
    question_text = questionText,
    asking_member = askingMemberId,
    question_tabled_when = dateTabled,
    question_answering_when = dateForAnswer,
    question_answering_body = answeringBodyName,
    question_answering_body_id = answeringBodyId,
    answering_member = answeringMemberId) %>%
  # Ensure variables are the correct class 
  mutate(
    question_id = as.character(question_id),
    question_text = as.character(question_text),
    asking_member = as.character(asking_member),
    question_tabled_when = as.character(as.Date(question_tabled_when)),
    question_answering_when = as.character(as.Date(question_answering_when)),
    question_answering_body = as.character(question_answering_body),
    question_answering_body_id = as.character(question_answering_body_id),
    answering_member = as.character(answering_member),
    oral_written = "written") # add written_oral dummy


question_df <- rbind(oral_question_df, written_question_df)

dbWriteTable(db, "questions", question_df, overwrite = TRUE) 

rm(oral_question_df, written_question_df, question_df)

# This code can also be run by sourcing scripts/03_pull-members-endpoint.R 

pull_members <- function(base_url, df) {

  for (i in seq_along(df$member_id)) {

    url <- paste0( # Build request URL
      base_url, "/",
      df$member_id[i],
      "?detailsForDate=",
      df$question_tabled_when[i])

    if (i == 1) { # If 1st iteration, create response,

      response <- httr::GET(url) %>% httr::content("parsed") # Pull request
      response <- response[1] # Extract list with response
      response <- c(
        date = df$question_tabled_when[i], response[[1]]) # Merge with date
      response <- list(response) # Convert to list

    } else { #  else create response2, then merge
      response_new <- httr::GET(url) %>% httr::content("parsed")
      response_new <- response_new[1]
      response_new <- c(
        date = df$question_tabled_when[i], response_new[[1]])
      response_new <- list(response_new)

      response <- c(response, response_new) # Merge responses
    }

    Sys.sleep(1)

    print(paste0("LOG Member Pull | ", Sys.time(), " | ", i, " of ", nrow(df), " done"))
  }
  return(response)
}

# Query question table to get MP-date pairs

members_asking <- dbGetQuery(db,
  "
  SELECT 
    asking_member AS member_id, 
    question_tabled_when
  FROM questions
  ")  %>%
  replace_na_chr() # When flattening the API responses, we replaced null values with character "NA" values, this function converts them back to NAs that R can recognise.

ministers_answering <- dbGetQuery(db,
  "
  SELECT 
    answering_member AS member_id,
    question_tabled_when
  FROM questions
  ")  %>%
  replace_na_chr() # When flattening the API responses, we replaced null values with character "NA" values, this function converts them back to NAs that R can recognise.

q_parameters <- rbind(members_asking, ministers_answering)

# Only keep unique MP-date pairs to avoid pulling the same information twice
q_parameters <- unique(q_parameters) %>%
  filter(member_id != 0) # Remove 0s because these indicate no minister has answered

# Apply function to pull members 
members <- pull_members(
  "https://members-api.parliament.uk/api/Members",
  q_parameters)

saveRDS(members, "data/members_raw.Rds")

members <- readRDS("data/members_raw.Rds") 

# Replace "null" values with NA so they are kept in the structure of the list
members <-  lapply(
  members, 
  function(x) {lapply(x, replace_null_with_na)})

# Convert to dataframe
for (i in seq_along(members)) {
    if (i == 1) { 
      members_df <- members[i] %>%
        unlist() %>%
        t() %>%
        data.frame()
    } else {
      members_df_new <- members[i] %>%
        unlist() %>%
        t() %>%
        data.frame()
      members_df <- rbind(members_df, members_df_new)
    }
}

members_df <- members_df %>%
  select(
    member_date_valid = date, 
    member_id = id, 
    name_display = nameDisplayAs, 
    gender = gender, 
    latest_constituency = latestHouseMembership.membershipFromId,
    latest_party_id = latestParty.id
  )

# Some MP characteristics change over time, so we collected unique MP-day queries. 
# However, characteristics do not change daily so there is lots of repitition.
# The following code groups MPs by the mutable variables (i.e. unique combinations, 
# then summarises the earliest valid, and the latest valid date)
# Before this function there are 4225 observations, and after, only 482.  

members_df_grouped <- members_df %>%
  group_by( # Group by all variables apart from date
    member_id, 
    name_display, 
    gender, 
    latest_constituency,
    latest_party_id
  ) %>%
  summarize( # Summarise earliest date this is valid for and latest. This gives us a range of vlaues where this combination is duplicated 
    member_date_valid_min = min(member_date_valid), 
    member_date_valid_max = max(member_date_valid)
  ) %>%
  mutate(
    member_id = as.character(member_id), 
    name_display = as.character(name_display), 
    gender = as.character(gender), 
    latest_constituency = as.character(latest_constituency),
    latest_party_id = as.character(latest_party_id), 
    member_date_valid_min = as.character(member_date_valid_min), 
    member_date_valid_max = as.character(member_date_valid_max)
  )

#unique(members_df$member_id) %>% length() # This returns 474, indicating there are changes 

dbWriteTable(db, "members", members_df_grouped, overwrite = TRUE)


# This code can also be run by sourcing scripts/04_pull-constituency-endpoints.R 

MPs <- dbGetQuery(db,
  "
  SELECT * 
  FROM members 
  ")  %>%
  replace_na_chr() # When flattening the API responses, we replaced null values with character "NA" values, this function converts them back to NAs that R can recognise.

constituencies <- MPs$latest_constituency %>% 
  unique()

constituencies <- 
  data.frame(
    constituency_id = constituencies
  ) %>%
  mutate(

    cons_name = NA, 
    cons_start_date = NA, 
    cons_end_date = NA,

    last_election_1_electorate = NA,
    last_election_1_turnout = NA,
    last_election_1_majority = NA, 
    last_election_1_result = NA, 
    last_election_1_winning_party = NA,
    last_election_1_election_ID = NA,
    last_election_1_electionDate = NA,
    last_election_1_isGeneralElection = NA,

    last_election_2_electorate = NA,
    last_election_2_turnout = NA,
    last_election_2_majority = NA, 
    last_election_2_result = NA, 
    last_election_2_winning_party = NA,
    last_election_2_election_ID = NA,
    last_election_2_electionDate = NA,
    last_election_2_isGeneralElection = NA,

    last_election_3_electorate = NA,
    last_election_3_turnout = NA,
    last_election_3_majority = NA, 
    last_election_3_result = NA, 
    last_election_3_winning_party = NA,
    last_election_3_election_ID = NA,
    last_election_3_electionDate = NA,
    last_election_3_isGeneralElection = NA,

    last_election_4_electorate = NA,
    last_election_4_turnout = NA,
    last_election_4_majority = NA, 
    last_election_4_result = NA, 
    last_election_4_winning_party = NA,
    last_election_4_election_ID = NA,
    last_election_4_electionDate = NA,
    last_election_4_isGeneralElection = NA,

    shapefile = NA
    )

### Pull basic details 

pull_const_info <- function(cons_id) {
  url <- paste0(
    "https://members-api.parliament.uk/api/Location/Constituency/",
    cons_id)

    basic_info <- httr::GET(url) %>%
    httr::content("parsed")

    return(basic_info)
}

for(i in seq_along(constituencies$constituency_id)) {
  response <- pull_const_info(constituencies$constituency_id[i])
  response <- response[[1]]

  constituencies$cons_name[i] <- response$name
  constituencies$cons_start_date[i] <- response$startDate
  constituencies$cons_end_date[i] <- ifelse(is.null(response$endDate), NA, response$endDate)

  Sys.sleep(0.5)
  print(paste0("LOG | Constituency API call - basic | ", Sys.time(), " | ", i, " of ", length(constituencies$constituency_id), " done"))

}

### Pull shape file

get_cons_shapefile <- function(cons_id) {
  url <- paste0(
    "https://members-api.parliament.uk/api/Location/Constituency/",
    cons_id,
    "/Geometry")

  shapefile <- httr::GET(url) %>%
    httr::content("parsed")

  return(shapefile)
}

for(i in seq_along(constituencies$constituency_id)) {
  response <- get_cons_shapefile(constituencies$constituency_id[i])
  response <- response[[1]]

  constituencies$shapefile[i] <- response

  Sys.sleep(0.5)
  print(paste0("LOG | Constituency API call - shapefile | ", Sys.time(), " | ", i, " of ", length(constituencies$constituency_id), " done"))
}

### Pull election results

get_cons_election_results <- function(cons_id) {
  url <- paste0(
    "https://members-api.parliament.uk/api/Location/Constituency/",
    cons_id,
    "/ElectionResults")

  results <- httr::GET(url) %>%
    httr::content("parsed")

  return(results)
}

for (i in seq_along(constituencies$constituency_id)) {
  response <- get_cons_election_results(constituencies$constituency_id[i])
  response <- response[[1]]

  response <- lapply(response, function(lst) {lapply(lst, replace_null_with_na)})

  constituencies$last_election_1_electorate[i] <- response[[1]]$electorate
  constituencies$last_election_1_turnout[i] <- response[[1]]$turnout  
  constituencies$last_election_1_majority[i] <- response[[1]]$majority
  constituencies$last_election_1_result[i] <- response[[1]]$result
  # If no winner recorded, skip this and assign NA
  if(length(response[[1]]$winningParty) > 1) { # When there is content in the winning party sublist, the length will be greater than 1
    constituencies$last_election_1_winning_party[i] <- response[[1]]$winningParty$id
  } else {
     constituencies$last_election_1_winning_party[i] <- NA
  }
  constituencies$last_election_1_election_ID[i] = response[[1]]$electionId
  constituencies$last_election_1_electionDate[i] = response[[1]]$electionDate
  constituencies$last_election_1_isGeneralElection[i] = response[[1]]$isGeneralElection

  constituencies$last_election_2_electorate[i] <- response[[2]]$electorate
  constituencies$last_election_2_turnout[i] <- response[[2]]$turnout 
  constituencies$last_election_2_majority[i] <- response[[2]]$majority
  constituencies$last_election_2_result[i] <- response[[2]]$result
  # If no winner recorded, skip this and assign NA
  if(length(response[[2]]$winningParty) > 1) {
    constituencies$last_election_2_winning_party[i] <- response[[2]]$winningParty$id
  } else {
     constituencies$last_election_2_winning_party[i] <- NA
  }
  constituencies$last_election_2_election_ID[i] = response[[2]]$electionId
  constituencies$last_election_2_electionDate[i] = response[[2]]$electionDate
  constituencies$last_election_2_isGeneralElection[i] = response[[2]]$isGeneralElection

  constituencies$last_election_3_electorate[i] <- response[[3]]$electorate
  constituencies$last_election_3_turnout[i] <- response[[3]]$turnout  
  constituencies$last_election_3_majority[i] <- response[[3]]$majority
  constituencies$last_election_3_result[i] <- response[[3]]$result
  # If no winner recorded, skip this and assign NA
  if(length(response[[3]]$winningParty) > 1) {
    constituencies$last_election_3_winning_party[i] <- response[[3]]$winningParty$id
  } else {
     constituencies$last_election_3_winning_party[i] <- NA
  }
  constituencies$last_election_3_election_ID[i] = response[[3]]$electionId
  constituencies$last_election_3_electionDate[i] = response[[3]]$electionDate
  constituencies$last_election_3_isGeneralElection[i] = response[[3]]$isGeneralElection

  constituencies$last_election_4_electorate[i] <- response[[4]]$electorate
  constituencies$last_election_4_turnout[i] <- response[[4]]$turnout   
  constituencies$last_election_4_majority[i] <- response[[4]]$majority
  constituencies$last_election_4_result[i] <- response[[4]]$result
  # If no winner recorded, skip this and assign NA
  if(length(response[[4]]$winningParty) > 1) {
    constituencies$last_election_4_winning_party[i] <- response[[4]]$winningParty$id
  } else {
     constituencies$last_election_4_winning_party[i] <- NA
  }
  constituencies$last_election_4_election_ID[i] = response[[4]]$electionId
  constituencies$last_election_4_electionDate[i] = response[[4]]$electionDate
  constituencies$last_election_4_isGeneralElection[i] = response[[4]]$isGeneralElection

  Sys.sleep(0.5)
  
  print(paste0("LOG | Constituency API call - elections | ", Sys.time(), " | ", i, " of ", length(constituencies$constituency_id), " done"))
}

saveRDS(constituencies, "data/constituencies_api_raw.Rds")

# This code can also be run by sourcing scripts/04_selinium-scrape-HoC-dashboard.R 

# NOTE cons_hoc returns 610 not 472 because it was pulled based on constituencies in all oral questions, not just 2023

# Read in data from the constituency endpoint pull
cons <- readRDS("data/constituencies_api_raw.Rds") 

# Make new dataframe
cons <- cons %>%
  select(constituency_id, cons_name) %>%
  unique() %>% # Keep only unqiue constituencies 
  mutate( # Initialise variables
    region_nation_hoclib23 = NA,
    population_hoclib23 = NA,
    area_hoclib23 = NA,
    age_0_29_hoclib23 = NA,
    age_30_64_hoclib23 = NA,
    age_65_plus_hoclib23 = NA, 
    uc_claimants_hoclib23 = NA, 
    median_house_price_hoclib23 = NA
  )

# Check whether constituencies have already been pulled and saved. If they have, filter out these so they are not re-scraped.
# If running for the first time, you will not be able to read in cons_hoc, so the filtering is skipped.
if (file.exists("data/hoc_library_scrape_raw.Rds")) {
  cons_hoc <- readRDS("data/hoc_library_scrape_raw.Rds")
  cons$check_already_pulled <- cons$cons_name %in% cons_hoc$cons_name
  cons <- cons %>% filter(check_already_pulled == FALSE)
} 

# Set selinium browser
rD <- rsDriver(browser=c("firefox"), verbose = F, port = netstat::free_port(random = TRUE), chromever = NULL) 
driver <- rD$client

# Define a list of css selectors

# The dashboard is contained within an "iframe". 
# This allows a different html tree to be embedded within the main html of the webpage meaning any CSS paths do not point to the actual path of the webpage. 
# To do this, we need to identify the iframe and use `switchToFrame()` to identify elements on the dashboard. 

selector_list <- list()

selector_list$search_dropdown <- "/html/body/div[1]/report-embed/div/div[1]/div/div/div/div/exploration-container/div/div/docking-container/div/div/div/div/exploration-host/div/div/exploration/div/explore-canvas/div/div[2]/div/div[2]/div[2]/visual-container-repeat/visual-container[1]/transform/div/div[3]/div/div/visual-modern/div/div/div[2]/div/i"

selector_list$search_box <- "/html/body/div[7]/div[1]/div/div[1]/input"

selector_list$search_result <- "/html/body/div[7]/div[1]/div/div[2]/div/div[1]/div/div/div[1]/div/span"

selector_list$region_nation <- "/html/body/div[1]/report-embed/div/div[1]/div/div/div/div/exploration-container/div/div/docking-container/div/div/div/div/exploration-host/div/div/exploration/div/explore-canvas/div/div[2]/div/div[2]/div[2]/visual-container-repeat/visual-container[2]/transform/div/div[3]/div/div/visual-modern/div/div/div/div[1]/div/div/div/div/div"

selector_list$population <- "/html/body/div[1]/report-embed/div/div[1]/div/div/div/div/exploration-container/div/div/docking-container/div/div/div/div/exploration-host/div/div/exploration/div/explore-canvas/div/div[2]/div/div[2]/div[2]/visual-container-repeat/visual-container[3]/transform/div/div[3]/div/div/visual-modern/div/div/div/p[2]/span"

selector_list$area <- "/html/body/div[1]/report-embed/div/div[1]/div/div/div/div/exploration-container/div/div/docking-container/div/div/div/div/exploration-host/div/div/exploration/div/explore-canvas/div/div[2]/div/div[2]/div[2]/visual-container-repeat/visual-container[5]/transform/div/div[3]/div/div/visual-modern/div/div/div/p[2]/span"

selector_list$age_0_29 <- "/html/body/div[1]/report-embed/div/div[1]/div/div/div/div/exploration-container/div/div/docking-container/div/div/div/div/exploration-host/div/div/exploration/div/explore-canvas/div/div[2]/div/div[2]/div[2]/visual-container-repeat/visual-container[11]/transform/div/div[3]/div/div/visual-modern/div/div/div/div[1]/div/div/div/div/div/div[1]"

selector_list$age_30_64 <- "/html/body/div[1]/report-embed/div/div[1]/div/div/div/div/exploration-container/div/div/docking-container/div/div/div/div/exploration-host/div/div/exploration/div/explore-canvas/div/div[2]/div/div[2]/div[2]/visual-container-repeat/visual-container[13]/transform/div/div[3]/div/div/visual-modern/div/div/div/div[1]/div/div/div/div/div/div[1]"

selector_list$age_65_plus <- "/html/body/div[1]/report-embed/div/div[1]/div/div/div/div/exploration-container/div/div/docking-container/div/div/div/div/exploration-host/div/div/exploration/div/explore-canvas/div/div[2]/div/div[2]/div[2]/visual-container-repeat/visual-container[15]/transform/div/div[3]/div/div/visual-modern/div/div/div/div[1]/div/div/div/div/div/div[1]"

selector_list$uc_claimants <- "/html/body/div[1]/report-embed/div/div[1]/div/div/div/div/exploration-container/div/div/docking-container/div/div/div/div/exploration-host/div/div/exploration/div/explore-canvas/div/div[2]/div/div[2]/div[2]/visual-container-repeat/visual-container[28]/transform/div/div[3]/div/div/visual-modern/div/div/div/div[1]/div/div/div/div/div[1]/div[1]"

selector_list$house_prices <- "/html/body/div[1]/report-embed/div/div[1]/div/div/div/div/exploration-container/div/div/docking-container/div/div/div/div/exploration-host/div/div/exploration/div/explore-canvas/div/div[2]/div/div[2]/div[2]/visual-container-repeat/visual-container[39]/transform/div/div[3]/div/div/visual-modern/div/div/div/div[1]/div/div/div/div/div[2]/div[1]"

constituency_dash_scraper <- function(
  constituency_name, 
  wait_base = 1 # Allows user to adjust wait lengths (e.g if running on a slow connection)
                # If you get a 'could not find element' error, try adjusting the wait time as the dashboard takes a while to load 
  ){
  # Find dropdown box and click on it 
  search_dropdown <- driver$findElement(using = "xpath", value = selector_list$search_dropdown)
  search_dropdown$clickElement()
  
  # Find search box and type constituency name
  Sys.sleep(wait_base * 2)
  search_box <- driver$findElement(using = "xpath", value = selector_list$search_box)
  #search_box$clickElement() # Do not strictly need this, but if not working try uncommenting
  search_box$clearElement()
  search_box$sendKeysToElement(list(constituency_name))

  Sys.sleep(wait_base * 4) # This requires a long time to load.
  # Click on the first result to load data
  first_result <- driver$findElement(using = "xpath", value = selector_list$search_result)
  first_result$clickElement()
  
  Sys.sleep(wait_base * 4) # Wait for data to load
  
  # EXTRACT TEXT FROM ELEMENTS

  # Set defaults as NA
  region_nation_text <- NA
  population_text <- NA
  area_text <- NA
  age_0_29_text <- NA
  age_30_64_text <- NA
  age_65_plus_text <- NA
  uc_claimants_text <- NA
  house_prices_text <- NA

  # Region or nation
  tryCatch({ # Prevent loop from closing if no data available
  suppressMessages({ 
    region_nation <- driver$findElement(using = "xpath", value = selector_list$region_nation)
    region_nation_text <- region_nation$getElementText()[[1]]
    })
  }, error = function(e) {
    # Print error message, no need to assign NA as we have set NA as default
    print(paste0("Log: NA assigned for REGION/NATION at iteration ", i))
  })

  # Population 
  tryCatch({
  suppressMessages({ 
    population <- driver$findElement(using = "xpath", value = selector_list$population)
    population_text <- population$getElementText()[[1]]
    })
  }, error = function(e) {
    print(paste0("Log: NA assigned for POPULATION at iteration ", i))
  })

  # Area in sq km
  tryCatch({
  suppressMessages({ 
    area <- driver$findElement(using = "xpath", value = selector_list$area)
    area_text <- area$getElementText()[[1]]
    })
  }, error = function(e) {
    print(paste0("Log: NA assigned for AREA at iteration ", i))
  })

  # Age composition 
  tryCatch({
  suppressMessages({
    age_0_29 <- driver$findElement(using = "xpath", value = selector_list$age_0_29)
    age_0_29_text <- age_0_29$getElementText()[[1]]
    })
  }, error = function(e) {
    print(paste0("Log: NA assigned for AGE 0-29 PLUS at iteration ", i))
  })

  tryCatch({
  suppressMessages({
    age_30_64 <- driver$findElement(using = "xpath", value = selector_list$age_30_64)
    age_30_64_text <- age_30_64$getElementText()[[1]]
    })
  }, error = function(e) {
    print(paste0("Log: NA assigned for AGE 30-64 PLUS at iteration ", i))
  })

  tryCatch({
  suppressMessages({
    age_65_plus <- driver$findElement(using = "xpath", value = selector_list$age_65_plus)
    age_65_plus_text <- age_65_plus$getElementText()[[1]]
    })
  }, error = function(e) {
    print(paste0("Log: NA assigned for AGE 64 PLUS at iteration ", i))
  })

  # Universal credit claimants 
  tryCatch({
  suppressMessages({
    uc_claimants <- driver$findElement(using = "xpath", value = selector_list$uc_claimants)
    uc_claimants_text <- uc_claimants$getElementText()[[1]]
    })
  }, error = function(e) {
    print(paste0("Log: NA assigned for UC CLAIMANTS at iteration ", i))
  })

  # House price
  tryCatch({
    suppressMessages({
      house_prices <- driver$findElement(using = "xpath", value = selector_list$house_prices)
      house_prices_text <- house_prices$getElementText()[[1]]
    })
  }, error = function(e) {
    print(paste0("Log: NA assigned for HOUSE PRICE at iteration ", i))
  })

 # Merge results into a list
  results = list(
    region_nation_text, 
    population_text, area_text, 
    age_0_29_text, age_30_64_text, age_65_plus_text,
    uc_claimants_text, house_prices_text)

  return(results)

}

# Run the scraper

# Navigate to home page outside of the loop to avoid reloading each time
driver$navigate("https://commonslibrary.parliament.uk/constituency-dashboard/")

Sys.sleep(1)

# The dashboard exists within a sub-page. Unless we "switch" to this subframe, the css paths will be broken
# Identify and switch to sub-page 
iframe <- driver$findElement(using = "xpath", value = "//iframe[@title='Constituency dashboard']")
driver$switchToFrame(iframe)
Sys.sleep(4)

# Set the number to start from in case loop is interuppted but we have cached results
start_from = 1 

for (i in start_from:length(cons$constituency_id)) {

  results <- constituency_dash_scraper(cons$cons_name[i], wait_base = 1)

    cons$region_nation_hoclib23[i] <- results[[1]]

    cons$population_hoclib23[i] <- results[[2]]

    cons$area_hoclib23[i] <- results[[3]]

    cons$age_0_29_hoclib23[i] <- results[[4]]
    cons$age_30_64_hoclib23[i] <- results[[5]]
    cons$age_65_plus_hoclib23[i] <- results[[6]]

    cons$uc_claimants_hoclib23[i] <- results[[7]]
    cons$median_house_price_hoclib23[i] <- results[[8]]

 # Cache results collected so far
  if(i == start_from){
    saveRDS(cons, paste0("data/cache_cons_at", i, ".Rds"))
  } else {
     saveRDS(cons, paste0("data/cache_cons_at", i, ".Rds"))
     file.remove(paste0("data/cache_cons_at", i-1, ".Rds")) # delete last cached object
  }
  
  Sys.sleep(1)

  print(paste0(i, " of ", nrow(cons), " done."))

}

# Kill driver and java processes
driver$close()
rD$server$stop()
system("taskkill /im java.exe /f", intern=FALSE, ignore.stdout=FALSE)

if (file.exists("data/hoc_library_scrape_raw.Rds")) {
  cons$check_already_pulled <- NULL
  saveRDS(cons, "data/hoc_library_scrape_raw_extra.Rds")

  cons_hoc <- readRDS("data/hoc_library_scrape_raw.Rds")
  cons <- rbind(cons_hoc, cons)

  saveRDS(cons, "data/hoc_library_scrape_raw.Rds")

} else {
  # Save output
  saveRDS(cons, "data/hoc_library_scrape_raw.Rds")
}


# Clean dashboard data

cons_hoc <- readRDS("data/hoc_library_scrape_raw.Rds")

# pop numeric 
cons_hoc$population_hoclib23 <- cons_hoc$population_hoclib23 %>%
  str_remove_all(",") %>%
  as.numeric()

# area numeric

cons_hoc$area_hoclib23 <- cons_hoc$area_hoclib23 %>%
  str_extract(".*(?=\\s*sq\\.\\s*km)") %>%
  str_remove_all(",") %>%
  as.numeric()

# age perc

cons_hoc$age_0_29_hoclib23 <- cons_hoc$age_0_29_hoclib23 %>%
  str_remove_all("%") %>%
  as.numeric()
cons_hoc$age_0_29_hoclib23 <- cons_hoc$age_0_29_hoclib23/100 # Convert to proportion 

cons_hoc$age_30_64_hoclib23 <- cons_hoc$age_30_64_hoclib23 %>%
  str_remove_all("%") %>%
  as.numeric()
cons_hoc$age_30_64_hoclib23 <- cons_hoc$age_30_64_hoclib23/100 # Convert to proportion 

cons_hoc$age_65_plus_hoclib23 <- cons_hoc$age_65_plus_hoclib23 %>%
  str_remove_all("%") %>%
  as.numeric()
cons_hoc$age_65_plus_hoclib23 <- cons_hoc$age_65_plus_hoclib23/100 # Convert to proportion 

# uc numeric

cons_hoc$uc_claimants_hoclib23 <- cons_hoc$uc_claimants_hoclib23 %>%
  str_remove_all(",") %>%
  as.numeric()

# house price numeric

cons_hoc$median_house_price_hoclib23 <- cons_hoc$median_house_price_hoclib23 %>%
  str_remove_all(",|£") %>%
  as.numeric()

# Merge API and dashboard data 

cons_api <- readRDS("data/constituencies_api_raw.Rds")

cons_hoc <- cons_hoc %>%
  select(-cons_name)

cons <- left_join(cons_api, cons_hoc, by = "constituency_id")

# Write out to database 

dbWriteTable(db, "constituencies", cons, overwrite = TRUE)


response <- httr::GET("https://members-api.parliament.uk/api/Parties/GetActive/1") %>%
  httr::content("parsed")

parties <- response$items
parties <- replace_null_with_na(parties)

for (i in 1:length(parties)) {
  if (i == 1) {
    parties_df <- data.frame(
      party_id = c(parties[[i]]$value$id),
      party_name = c(parties[[i]]$value$name),
      party_abbreviation = c(parties[[i]]$value$abbreviation),
      party_colour = c(parties[[i]]$value$backgroundColour)
    )
  } else {
    parties_df2 <- data.frame(
      party_id = c(parties[[i]]$value$id),
      party_name = c(parties[[i]]$value$name),
      party_abbreviation = c(parties[[i]]$value$abbreviation),
      party_colour = c(parties[[i]]$value$backgroundColour)
    )
    
    parties_df <- rbind(parties_df, parties_df2)
  }
}

parties_df <- parties_df %>%
  mutate(
    party_id = as.character(party_id),
    party_name = as.character(party_name), 
    party_abbreviation = as.character(party_abbreviation),
    party_colour = as.character(party_colour))

dbWriteTable(db, "parties", parties_df, overwrite = TRUE)


question_text <- dbGetQuery(db, 
  "
  SELECT question_id, question_text
  FROM questions
  "
)  %>%
  replace_na_chr() # When flattening the API responses, we replaced null values with character "NA" values, this function converts them back to NAs that R can recognise.

# Measure 

# Initalise variables as NA
question_text$is_econ <- NA
question_text$is_health_welf <- NA

# clean question text 

question_text$question_text <- question_text$question_text %>%
    tolower() %>% # Convert to lower case
    tm::removePunctuation() # remove punctuation

# Define dictionaries 

# NOTE Citation: Albugh, Quinn, Julie Sevenans and Stuart Soroka. 2013. Lexicoder Topic Dictionaries, June 2013 versions, McGill University, Montreal, Canada.

# Download Lexicon Policy Topic Dictionaries 
if (!file.exists("data/lexicoder_dictionaries/LTDjun2013/policy_agendas_english.lcd")) {

  dir.create("data/lexicoder_dictionaries")
  download.file(
    "https://www.snsoroka.com/s/LTDjun2013.zip", 
    "data/lexicoder_dictionaries/policy_topics.zip")

  unzip(
    "data/lexicoder_dictionaries/policy_topics.zip", 
    exdir = "data/lexicoder_dictionaries", overwrite = TRUE)
}
  
  # NOTE ChatGPT used to write code that parses XML 

parsed_string <- readLines("data/lexicoder_dictionaries/LTDjun2013/policy_agendas_english.lcd") %>%
  paste(collapse = "\n") %>%
  xmlTreeParse(useInternalNodes = TRUE) 


extract_pnodes <- function(cnode) {
  sapply(xpathApply(cnode, "./pnode"), function(pnode) {
    xmlAttrs(pnode)[["name"]]
  })
}

# Extract cnodes and their corresponding pnodes
dictionaries_output <- xpathApply(parsed_string, "//cnode", function(cnode) {
  cnode_name <- xmlAttrs(cnode)[["name"]]
  pnodes <- extract_pnodes(cnode)
  return(list(cnode_name = cnode_name, pnodes = pnodes))
})

# Convert to a single list 
dictionaries <- list()

for (item in dictionaries_output) {
  cnode_name <- item$cnode_name
  pnodes <- item$pnodes
  dictionaries[[cnode_name]] <- pnodes
}

rm(cnode_name, dictionaries_output, item, parsed_string, pnodes, raw_string)


econ_dict <- c( # Select relevant dictionaries 
  dictionaries$macroeconomics,
  dictionaries$finance,
  dictionaries$foreign_trade
  )

# Convert to regex string and convert to lower for matching
econ_dict <- econ_dict %>%
  paste(collapse="|") %>%
  tolower()

health_welf_dict <- c( # Select relevant dictionaries 
  dictionaries$healthcare, 
  dictionaries$social_welfare
)

health_welf_dict <- health_welf_dict %>%
  paste(collapse="|") %>%
  tolower()

question_text <- question_text %>%
  mutate(
    is_econ = NA,
    is_health_welf = NA
  ) %>%
  mutate(
    is_econ = 
      ifelse(
        str_detect(question_text, econ_dict), 1, 0),
    is_health_welf = 
      ifelse(
        str_detect(question_text, health_welf_dict), 1, 0)
  )

mean(question_text$is_econ)
mean(question_text$is_health_welf)

dbWriteTable(db, "question_topics", question_text, overwrite = TRUE)

dbListTables(db)
db_table_check("questions")
db_table_check("members")
db_table_check("constituencies")
db_table_check("parties")
db_table_check("question_topics")
analysis_df <- dbGetQuery(
  db, 
  "
  SELECT 

    members.name_display AS MP,
    parties.party_abbreviation AS party_abbreviation, 
    
    SUM(question_topics.is_econ)/COUNT(*) AS econ_prop,
    SUM(question_topics.is_health_welf)/COUNT(*) AS health_welf_prop,
    
    constituencies.uc_claimants_hoclib23 AS uc_claimants,
    constituencies.median_house_price_hoclib23 AS median_house_price,

    constituencies.population_hoclib23 / constituencies.area_hoclib23 AS density,

    constituencies.age_0_29_hoclib23 AS age_29,
    constituencies.age_30_64_hoclib23 AS age_30_64,
    constituencies.age_65_plus_hoclib23 AS age_65,

    /* Majority */ 
    constituencies.last_election_1_majority, 
    constituencies.last_election_2_majority, 
    constituencies.last_election_3_majority, 
    constituencies.last_election_4_majority,

    constituencies.last_election_1_electorate,
    constituencies.last_election_2_electorate,
    constituencies.last_election_3_electorate,
    constituencies.last_election_4_electorate,

    /* results */    
    constituencies.last_election_1_result,
    constituencies.last_election_2_result,
    constituencies.last_election_3_result,
    constituencies.last_election_4_result

  FROM questions
  
  JOIN question_topics ON questions.question_id = question_topics.question_id
  
  LEFT JOIN members ON questions.asking_member = members.member_id
    /* this has to be joined before anything */ 
    /* from members to avoid dropping rows */
    /* select row where date of question comes between the dates valid range */

    AND REPLACE(questions.question_tabled_when, '-', '') 
    /* no date class in SQLite, so convert to string*/
    
      BETWEEN REPLACE(members.member_date_valid_min, '-', '') 
        AND REPLACE(members.member_date_valid_max, '-', '')
  
  LEFT JOIN constituencies ON members.latest_constituency = constituencies.constituency_id
  LEFT JOIN parties ON parties.party_id = members.latest_party_id

  GROUP BY members.member_id
  "
  )  %>%
  replace_na_chr() # When flattening the API responses, we replaced null values with character "NA" values, this function converts them back to NAs that R can recognise.

# Convert majority variables into +/- depending on whether current MP won or lost  
analysis_df <- analysis_df %>%
  mutate(
    last_election_1_majority = 
      ifelse(str_detect(last_election_1_result, party_abbreviation), 
        last_election_1_majority, 
        last_election_1_majority * -1),
    last_election_2_majority = 
      ifelse(str_detect(last_election_2_result, party_abbreviation), 
        last_election_2_majority, 
        last_election_2_majority * -1),
    last_election_3_majority = 
      ifelse(str_detect(last_election_3_result, party_abbreviation), 
        last_election_3_majority, 
        last_election_3_majority * -1),
    last_election_4_majority = 
      ifelse(str_detect(last_election_4_result, party_abbreviation), 
        last_election_4_majority, 
        last_election_4_majority * -1)
  ) %>% 
  select(-c(last_election_1_result, last_election_2_result, last_election_3_result, last_election_4_result))

# Calculate marginality 
analysis_df <- analysis_df %>%
  mutate(
    marginality_1 = (last_election_1_majority / last_election_1_electorate),
    marginality_2 = (last_election_2_majority / last_election_2_electorate),
    marginality_3 = (last_election_3_majority / last_election_3_electorate),
    marginality_4 = (last_election_4_majority / last_election_4_electorate)
  ) %>%
  select(-c(last_election_1_majority, last_election_2_majority, last_election_3_majority, last_election_4_majority, last_election_1_electorate, last_election_2_electorate,last_election_3_electorate,last_election_4_electorate))

# Calcualte mean marginality 
analysis_df <- analysis_df %>%  
  mutate(
    mean_marginality = rowMeans(select(., starts_with("marginality_")))
  ) %>%
  select(-c(marginality_1, marginality_2, marginality_3, marginality_4))

analysis_df <- analysis_df %>%
  mutate(econ_slant = econ_prop - health_welf_prop)

hist <- analysis_df %>%
  ggplot() +
  geom_vline(xintercept = 0)+
  geom_histogram(aes(x=econ_slant), alpha = 0.7) +
  xlim(-1,1)+
  labs(
    title = "Distribution of oral and written question slant", 
    x = "Slant", y = "Count") +
  theme(
    aspect.ratio = 1,
    panel.background = element_rect(fill = "white", color = "black"), 
    panel.grid = element_blank())

hist

geog_data <- dbGetQuery(db,
  "
  SELECT 
    constituencies.cons_name AS constituency, 
    
    SUM(question_topics.is_econ)/COUNT(*) AS econ_prop,
    SUM(question_topics.is_health_welf)/COUNT(*) AS health_welf_prop,

    constituencies.shapefile AS con_shapefile

  FROM questions

  LEFT JOIN members ON questions.asking_member = members.member_id
    /* this has to be joined before anything */ 
    /* from members to avoid dropping rows */
    /* select row where date of question comes between the dates valid range */

    AND REPLACE(questions.question_tabled_when, '-', '') 
    /* no date class in SQLite, so convert to string*/
    
      BETWEEN REPLACE(members.member_date_valid_min, '-', '') 
        AND REPLACE(members.member_date_valid_max, '-', '')

  LEFT JOIN constituencies ON members.latest_constituency = constituencies.constituency_id

  LEFT JOIN question_topics ON questions.question_id = question_topics.question_id
  
  GROUP BY constituencies.cons_name
  "
  )  %>%
  replace_na_chr() %>% # When flattening the API responses, we replaced null values with character "NA" values, this function converts them back to NAs that R can recognise.
  filter(!is.na(con_shapefile)) # Drop constituencies without shapefiles
 
geog_data <- geog_data %>%
  mutate(econ_slant = econ_prop - health_welf_prop)

# Add all constituencies as base map

# To plot the base map, we want all constituencies, not just ones where questions have been asked.
# We can get this from OSMaps
# For reproducibility, the following code downloads and processes the data programmatically
# To download manually, use https://osdatahub.os.uk/downloads/open/BoundaryLine

if (!file.exists("data/whole_UK_shapefile/Data/GB/westminster_const_region.shp")) { 

options(timeout=600) # Takes some time to download so we need to increase the "timeout" setting

dir.create("data/whole_UK_shapefile")
download.file("https://api.os.uk/downloads/v1/products/BoundaryLine/downloads?area=GB&format=ESRI%C2%AE+Shapefile&redirect", "data/whole_UK_shapefile/OS_zip.zip") #
options(timeout=60) # Reset timeout

# unzip 
unzip(
  "data/whole_UK_shapefile/OS_zip.zip", 
  files = c(
    "Data/GB/westminster_const_region.dbf",
    "Data/GB/westminster_const_region.prj",
    "Data/GB/westminster_const_region.shp",
    "Data/GB/westminster_const_region.shx"),
  exdir = "data/whole_UK_shapefile", 
  overwrite = TRUE)

file.remove("data/whole_UK_shapefile/OS_zip.zip")

}

basemap_sf <- 
  st_read(
    dsn = "data/whole_UK_shapefile/Data/GB/westminster_const_region.shp")

# Make shape files for constituencies in the database

# NOTE: approach to converting from GeoJSON from ChatGPT
temp_geojson <- tempfile(fileext = ".geojson") # Create a temporary file
writeLines(geog_data$con_shapefile, con = temp_geojson) # Write out to temporary file
geog_sf <- st_read(dsn = temp_geojson) # Read the GeoJSON file into an sf object
unlink(temp_geojson) # Delete temporary file

geog_data <- cbind(geog_data, geog_sf)
geog_data <- st_as_sf(geog_data) # Convert to SF



map <- 
  tm_shape(basemap_sf) +
  tm_sf(col = "white") +
  tm_shape(geog_data) +
  tm_polygons(
    col = "econ_slant",
    style = "cont",
    midpoint = 0,
    title = "Question topic slant",
    palette = "RdBu",
   #legend.hist = TRUE 
  ) +
  tm_legend(
    legend.position = c("right", "center"),
    legend.title.size = 5,
    legend.text.size = 4
  ) +
  tm_layout(
    title = "Distribution of question topic slant across UK Constituencies.\n\n", 
    title.size = 9
  )

map

lm_density <- summary(lm(econ_slant ~ density, data = analysis_df))
slope_density <- lm_density$coefficients[[2,1]]
se_density <- lm_density$coefficients[[2,2]]

plot_density <- analysis_df %>%
  ggplot(aes(x=density, y = econ_slant)) + 
  geom_point(size = 0.5, alpha = 0.5) + 
  geom_smooth(method = "lm") +
  geom_text(
    aes(x=15000, y= 0.5), 
    label = paste0(
      "Slope: ", signif(slope_density, digits = 3),
      "\n SE: ", signif(se_density, digits = 3)
    ),
    size = 3)+  
  labs(
    title = "Correlation between population density of an MP's consitutency and their question slant in 2023", 
    x = "Population density in 2023 (people per km^2)", 
    y = "Slant") +
  ylim(-1,1) +
  theme(
    panel.background = element_rect(fill = "white", color = "black"), 
    aspect.ratio = 1
  )

plot_density

# Run linear model
lm_uc <- summary(lm(econ_slant ~ uc_claimants, data = analysis_df))

# Save slope coefficients and SEs for plotting

slope_uc <- lm_uc$coefficients[[2,1]]
se_uc <- lm_uc$coefficients[[2,2]]


# Make plots

plot_uc <- analysis_df %>%
  ggplot(aes(x=uc_claimants, y = econ_slant)) + 
  geom_point(size = 0.5, alpha = 0.5) + 
  geom_smooth(method = "lm") +
  geom_text(
    aes(x=5000, y= 0.8), 
    label = paste0(
      "Slope: ", signif(slope_uc, digits = 3),
      "\n SE: ", signif(se_uc, digits = 3)
    ),
    size = 3) +  
  labs(
    title = "Correlation between the number of households claiming Universal Credit in an MP's consitutency and their question slant in 2023", 
    x = "Proportion of population claiming Universal Credit", 
    y = "Slant") +  
  ylim(-1,1) +
  theme(
    aspect.ratio = 1,
    panel.background = element_rect(fill = "white", color = "black"), 
    panel.grid = element_blank()) 


# Run linear model 
lm_marg <- summary(lm(econ_slant ~ mean_marginality, data = analysis_df))

slope_marg <- lm_marg$coefficients[[2,1]]
se_marg <- lm_marg$coefficients[[2,2]]

# Make plot

plot_marginality <- analysis_df %>%
  ggplot(aes(x=mean_marginality, y = econ_slant)) + 
  geom_point(size = 0.5, alpha = 0.5) + 
  geom_smooth(method = "lm") +
  geom_text(
    aes(x=-0.2, y= 0.5), 
    label = paste0(
      "Slope: ", signif(slope_marg, digits = 3),
      "\n SE: ", signif(se_marg, digits = 3)
    ),
    size = 3) +  
  labs(
    title = "Correlation between an MP's seat marginality and their question slant in 2023", 
    x = "Seat marginality", 
    y = "Slant") +  
  ylim(-1,1) +
  theme(
    aspect.ratio = 1,
    panel.background = element_rect(fill = "white", color = "black"), 
    panel.grid = element_blank())

plot_marginality
# Disconnect from local database
DBI::dbDisconnect(db)

sessionInfo()
R version 4.3.2 (2023-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.3 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0 
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

time zone: Europe/London
tzcode source: system (glibc)

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] gridExtra_2.3      rpart.plot_3.1.1   rpart_4.1.21       vip_0.4.1         
 [5] yardstick_1.2.0    workflowsets_1.0.1 workflows_1.1.3    tune_1.1.2        
 [9] rsample_1.2.0      recipes_1.0.9      parsnip_1.1.1      modeldata_1.2.0   
[13] infer_1.0.5        dials_1.2.0        scales_1.3.0       broom_1.0.5       
[17] tidymodels_1.1.1   ranger_0.16.0      ggrepel_0.9.4      sf_1.0-15         
[21] tmap_3.3-4         XML_3.99-0.16      tm_0.7-11          NLP_0.2-1         
[25] RSelenium_1.7.9    httr_1.4.7         RSQLite_2.3.4      DBI_1.1.3         
[29] here_1.0.1         lubridate_1.9.3    forcats_1.0.0      stringr_1.5.1     
[33] dplyr_1.1.4        purrr_1.0.2        readr_2.1.4        tidyr_1.3.0       
[37] tibble_3.2.1       ggplot2_3.4.4      tidyverse_2.0.0   

loaded via a namespace (and not attached):
  [1] RColorBrewer_1.1-3  rstudioapi_0.15.0   jsonlite_1.8.8     
  [4] semver_0.2.0        magrittr_2.0.3      farver_2.1.1       
  [7] rmarkdown_2.25      vctrs_0.6.5         memoise_2.0.1      
 [10] base64enc_0.1-3     terra_1.7-65        htmltools_0.5.7    
 [13] leafsync_0.1.0      raster_3.6-26       parallelly_1.36.0  
 [16] KernSmooth_2.23-22  htmlwidgets_1.6.4   stars_0.6-4        
 [19] cachem_1.0.8        iterators_1.0.14    lifecycle_1.0.4    
 [22] pkgconfig_2.0.3     Matrix_1.6-3        R6_2.5.1           
 [25] fastmap_1.1.1       future_1.33.1       digest_0.6.33      
 [28] colorspace_2.1-0    furrr_0.3.1         wdman_0.2.6        
 [31] rprojroot_2.0.4     leafem_0.2.3        crosstalk_1.2.1    
 [34] labeling_0.4.3      lwgeom_0.2-13       fansi_1.0.5        
 [37] timechange_0.2.0    mgcv_1.9-0          abind_1.4-5        
 [40] compiler_4.3.2      proxy_0.4-27        bit64_4.0.5        
 [43] withr_2.5.2         backports_1.4.1     MASS_7.3-60        
 [46] lava_1.7.3          tmaptools_3.1-1     leaflet_2.2.1      
 [49] classInt_0.4-10     caTools_1.18.2      tools_4.3.2        
 [52] units_0.8-5         future.apply_1.11.1 nnet_7.3-19        
 [55] glue_1.6.2          nlme_3.1-163        grid_4.3.2         
 [58] generics_0.1.3      gtable_0.3.4        tzdb_0.4.0         
 [61] class_7.3-22        data.table_1.14.8   hms_1.1.3          
 [64] sp_2.1-2            xml2_1.3.6          utf8_1.2.4         
 [67] foreach_1.5.2       pillar_1.9.0        lhs_1.1.6          
 [70] splines_4.3.2       lattice_0.22-5      survival_3.5-7     
 [73] bit_4.0.5           tidyselect_1.2.0    knitr_1.45         
 [76] xfun_0.41           hardhat_1.3.0       timeDate_4032.109  
 [79] stringi_1.8.2       DiceDesign_1.10     yaml_2.3.7         
 [82] evaluate_0.23       codetools_0.2-19    cli_3.6.1          
 [85] munsell_0.5.0       dichromat_2.0-0.1   Rcpp_1.0.11        
 [88] globals_0.16.2      png_0.1-8           binman_0.1.3       
 [91] gower_1.0.1         assertthat_0.2.1    blob_1.2.4         
 [94] bitops_1.0-7        GPfit_1.0-8         listenv_0.9.0      
 [97] viridisLite_0.4.2   slam_0.1-50         ipred_0.9-14       
[100] prodlim_2023.08.28  e1071_1.7-14        rlang_1.1.2